Skip to content

fix(efcore): support DTFx distributed tracing without breaking execution#108

Merged
lucaslorentz merged 2 commits into
lucaslorentz:mainfrom
ooraini:main
Jul 3, 2026
Merged

fix(efcore): support DTFx distributed tracing without breaking execution#108
lucaslorentz merged 2 commits into
lucaslorentz:mainfrom
ooraini:main

Conversation

@ooraini

@ooraini ooraini commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Problem

Enabling the Durable Task Framework's built-in distributed tracing
(CorrelationSettings.Current.EnableDistributedTracing = true,
Protocol.W3CTraceContext) against the EFCore backend doesn't just fragment
spans — it breaks orchestration execution entirely. A trivial saga (client
starts an orchestration that schedules one activity) hangs and never
completes
; the worker polls the database forever. With tracing off it
completes in well under a second.

Root cause

The EFCore backend never populates TaskOrchestrationWorkItem.TraceContext.

When tracing is enabled, DurableTask.Core.TaskOrchestrationDispatcher seeds an
AsyncLocal from the work item:

CorrelationTraceContext.Current = workItem.TraceContext; // null from EFCore

and later dereferences it unconditionally:

runtimeState.ExecutionStartedEvent.Correlation =
    CorrelationTraceContext.Current.SerializableTraceContext; // NullReferenceException

The resulting NullReferenceException aborts the work item, which is then
re-fetched and re-thrown forever — the observed infinite polling / hang. Note
this is the framework's legacy App-Insights correlation path, which executes
even when the protocol is W3CTraceContext, so it trips as soon as tracing is
turned on.

The client-side create_orchestration span still works because it captures
Activity.Current before storage; the worker-side orchestration and
activity spans never appear because the work item never processes
successfully.

Fix

In LockNextTaskOrchestrationWorkItemAsync, restore the work item's trace
context from the ExecutionStartedEvent's correlation payload, mirroring the
reference backends (Azure Storage, MSSQL):

private static void AttachTraceContext(TaskOrchestrationWorkItem workItem)
{
    if (!CorrelationSettings.Current.EnableDistributedTracing)
        return; // zero-overhead tracing-off default

    var correlation = workItem.OrchestrationRuntimeState?.ExecutionStartedEvent?.Correlation;
    workItem.TraceContext = TraceContextBase.Restore(correlation);
}

TraceContextBase.Restore(null) returns a valid empty context, and on later
turns it restores the correlation the dispatcher persisted onto
ExecutionStartedEvent.Correlation — giving proper cross-turn continuity. The
guard keeps the tracing-off default allocation-free and behaviourally identical.

Notably no schema/serialization changes

The W3C span carriers — ExecutionStartedEvent.ParentTraceContext and
TaskScheduledEvent.ParentTraceContext — already round-trip through the
Newtonsoft TypelessJsonDataConverter (they're plain public properties). Once
the null TraceContext NRE is removed, the OpenTelemetry spans connect on their
own. No new columns, no model changes, no serializer changes. Activity work
items need nothing extra (the activity dispatcher already reads
workItem.TraceContextBase?. null-safely).

Tests

Adds a distributed-tracing acceptance test following the existing per-storage
convention (base + InMemory/Postgres/SqlServer/MySql variants). It runs the
saga under a root System.Diagnostics.Activity with an ActivityListener on
"DurableTask.Core" and asserts:

  • (a) the orchestration completes with tracing on, and
  • (b) create_orchestrationorchestrationactivity spans all share the
    caller's root trace id.

The test class runs in a DisableParallelization collection because
CorrelationSettings.Current is a process-wide static.

Verification

  • Acceptance test: fails (hang + NRE) before the fix, green after —
    confirmed on InMemory and real Postgres (all three spans on one trace
    id, execution completes).
  • No regression: 33/33 InMemory tests and 18/18 Postgres storage tests
    pass with tracing off; full solution builds with 0 warnings/0 errors on
    net8.0/net9.0/net10.0.

Fixed against Microsoft.Azure.DurableTask.Core 3.7.0 (works within its tracing
model; no newer core required).

🤖 Generated with Claude Code

ooraini and others added 2 commits July 2, 2026 19:51
Enabling DurableTask.Core distributed tracing
(CorrelationSettings.Current.EnableDistributedTracing = true) hung every
orchestration on the EFCore backend instead of merely fragmenting spans.

Root cause: the backend never populated TaskOrchestrationWorkItem.TraceContext.
When tracing is on, TaskOrchestrationDispatcher seeds
CorrelationTraceContext.Current from workItem.TraceContext (null) and then
dereferences it unconditionally
(ExecutionStartedEvent.Correlation = CorrelationTraceContext.Current.SerializableTraceContext),
throwing a NullReferenceException. The work item is aborted and retried
forever, so the orchestration never completes. This legacy App-Insights
correlation path runs even under the W3CTraceContext protocol.

Fix: in LockNextTaskOrchestrationWorkItemAsync, restore the work item's trace
context from the ExecutionStartedEvent's correlation payload
(TraceContextBase.Restore, which returns a valid empty context when there is
none), mirroring the reference backends. Guarded on EnableDistributedTracing so
the tracing-off default stays zero-overhead.

The W3C span carriers (ExecutionStartedEvent.ParentTraceContext /
TaskScheduledEvent.ParentTraceContext) already round-trip through the
Newtonsoft serializer, so no schema, column, or serialization changes are
needed — the entire bug was the null TraceContext.

Adds an acceptance test (InMemory + Postgres + SqlServer + MySql variants)
asserting execution completes and that client -> orchestration -> activity
spans all share the caller's root trace id.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lucaslorentz

Copy link
Copy Markdown
Owner

Thanks for the improvement!

@lucaslorentz

Copy link
Copy Markdown
Owner

LGTM.

@lucaslorentz lucaslorentz merged commit f8184ce into lucaslorentz:main Jul 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants